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Abstract 

The concept of overfitting in model selection is explained and demon- 
strated. After providing some background information on information 
,-^i theory and Kolmogorov complexity, we provide a short explanation of 

I Minimum Description Length and error minimization. We conclude with 

I a discussion of the typical features of overfitting in model selection. 

^ 1 The paradox of overfitting 

Machine learning is the branch of Artificial Intelligence that deals with learning 
^j algorithms. Learning is a figurative description of what in ordinary science 

is also known as model selection and generalization. In computer science a 
— j model is a set of binary encoded values or strings, often the parameters of a 

function or statistical distribution. Models that parameterize the same function 

or distribution are called a family. Models of the same family are usually indexed 
^sO by the number of parameters involved. This number of parameters is also called 

the degree or the dimension of the model. 

To learn some real world phenomenon means to take some examples of the 



phenomenon and to select a model that describes them well. When such a model 
can also be used to describe instances of the same phenomenon that it was not 
• • trained on we say that it generalizes well or that it has a small generalization 

. 5^ error. The task of a learning algorithm is to minimize this generalization error. 

X 

Classical learning algorithms did not allow for logical dependencies |MP69j and 
were not very interesting to Artificial Intelligence. The advance of techniques 
like neural networks with back-propagation in the 1980's and Bayesian networks 
in the 1990's has changed this profoundly. With such techniques it is possible 
to learn very complex relations. Learning algorithms are now extensively used 
in applications like expert systems, computer vision and language recognition. 
Machine learning has earned itself a central position in Artificial Intelligence. 

A serious problem of most of the common learning algorithms is overfitting. 
Overfitting occurs when the models describe the examples better and better 
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but get worse and worse on other instances of the same phenomenon. This can 
make the whole learning process worthless. A good way to observe overfitting 
is to split a number of examples in two, a training set, and a test set and 
to train the models on the training set. Clearly, the higher the degree of the 
model, the more information the model will contain about the training set. But 
when we look at the generalization error of the models on the test set, we will 
usually see that after an initial phase of improvement the generalization error 
suddenly becomes catastrophically bad. To the uninitiated student this takes 
some effort to accept since it apparently contradicts the basic empirical truth 
that more information will not lead to worse predictions. We may well call this 
the paradox of overfitting . 

It might seem at first that overfitting is a problem specific to machine learning 
with its use of very complex models. And as some model families suffer less 
from overfitting than others the ultimate answer might be a model family that 
is entirely free from overfitting. But overfitting is a very general problem that 
has been known to statistics for a long time. And as overfitting is not the only 
constraint on models it will not be solved by searching for model families that 
are entirely free of it. Many families of models are essential to their field because 
of speed, accuracy, easy to teach mathematically, and other properties that are 
unlikely to be matched by an equivalent family that is free from overfitting. 
As an example, polynomials are used widely throughout all of science because 
of their many algorithmic advantages. They suffer very badly from overfitting. 
ARMA models are essential to signal processing and are often used to model 
time series. They also suffer badly from overfitting. If we want to use the model 
with the best algorithmic properties for our application we need a theory that 
can select the best model from any arbitrary family. 

2 An example of overfitting 

Figure [T] on page [3] gives a good example of overfitting. The upper graph shows 
two curves in the two-dimensional plane. One of the curves is a segment of the 
Lorenz attractor, the other a 43-degree polynomial. A Lorenz attractor is a 
complicated self similar object. Here it is only important because it is definitely 
not a polynomial and because its curve is relatively smooth. Such a curve can 
be approximated well by a polynomial. 

An n-degree polynomial is a function of the form 

f(x) = ao + aix + <22.t + • ■ • + a n x n , x G R (1) 

with an n + 1-dimensional parameter space (ao . . . a n ) € K" +1 . 

A polynomial is very easy to work with and polynomials are used throughout 
science to model (or approximate) other functions. If the other function has to 
be inferred from a sample of points that witness that function, the problem is 
called a regression problem. 

Based on a small training sample that witnesses our Lorenz attractor we search 
for a polynomial that optimally predicts future points that follow the same 
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Figure 1: An example of overfitting 




Lorenz attractor and the optimum 43-degree polynomial (the curve with smaller 
oscillations). The points are the 300 point training sample and the 3,000 point test 
sample. Both samples are independently identically distributed. The distribution 
over the x-axis is uniform over the support interval [0, 10]. Along the y-axis, the 
deviation from the Lorenz attractor is Gaussian with variance a 2 = 1. 




Generalization (mean squared error on the test set) analysis for polynomials of 
degree 0-60. The cc-axis shows the degree of the polynomial. The y-axis shows the 
generalization error on the test sample. It has logarithmic scale. 

The first value on the left is the 0-degree polynomial. It has a mean squared error 
of a 2 = 18 on the test sample. To the right of it the generalization error slowly 
decreases until it reaches a global minimum of a 2 = 2.7 at 43 degrees. After this 
the error shows a number of steep inclines and declines with local maxima that soon 
are much worse than the initial a 2 = 18. 
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distribution as the training sample — they witness the same Lorenz attractor, 
the same noise along the y-axis and the same distribution over the x-axis. Such 
a sample is called i.i.d., independently identically distributed. Here the i.i.d. 
assumption will be the only assumption about training samples and samples 
that have to be predicted. 

The Lorenz attractor in the graph is witnessed by 3,300 points. To simulate 
the noise that is almost always polluting our measurements, the points deviate 
from the curve of the attractor by a small distance along the y-axis. They are 
uniformly distributed over the interval [0, 10] of the x-axis and are randomly 
divided into a 300 point training set and a 3,000 point test set. The interval 
[0, 10] of the x-axis is called the support. 

The generalization analysis in the lower graph of Figure [T] shows what happens 
if we approximate the 300 point training set by polynomials of rising degree 
and measure the generalization error of these polynomials on the 3,000 point 
test set. Of course, the more parameters we choose, the better the polynomial 
will approximate the training set until it eventually goes through every single 
point of the training set. This is not shown in the graph. What is shown is the 
generalization error on the 3,000 points of the i.i.d. test set. The x-axis shows 
the degrees of the polynomial and the y-axis shows the generalization error. 

Starting on the left with a 0-degree polynomial (which is nothing but the mean 
of the training set) we see that a polynomial that approximates the training set 
well will also approximate the test set. Slowly but surely, the more parameters 
the polynomial uses the smaller the generalization error becomes. In the center 
of the graph, at 43 degrees, the generalization error becomes almost zero. But 
then something unexpected happens, at least in the eyes of the uninitiated 
student. For polynomials of 44 degrees and higher the error on the test set 
rises very fast and soon becomes much bigger than the generalization error of 
even the 0-degree polynomial. Though these high degree polynomials continue 
to improve on the training set, they definitely do not approximate our Lorenz 
attractor any more. They overfit. 

3 The definition of a good model 

Before we can proceed with a more detailed analysis of model selection we need 
to answer one important question: what exactly is a good model. And one 
popular belief which is persistent even among professional statisticians has to 
be dismissed right from the beginning: the model that will achieve the lowest 
generalization error does not have to have the same degree or even be of the 
same family as the model that originally produced the data. 

To drive this idea home we use a simple 4-degree polynomial as a source func- 
tion. This polynomial is witnessed by a 100 point training sample and a 3,000 
point test sample. To simulate noise, the points are polluted by a Gaussian 
distribution of variance a 2 — 1 along the y-axis. Along the x-axis they are 
uniformly distributed over the support interval [0, 10]. The graph of this ex- 
ample and the analysis of the generalization error are shown in Figure [2] The 
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generalization error shows that a 4-degree polynomial has a comparatively high 
generalization error. When trained on a sample of this size and noise there is 
only a very low probability that a 4-degree polynomial will ever show a satisfac- 
tory generalization error. Depending on the actual training sample the lowest 
generalization error is achieved for polynomials from 6 to 8 degrees. 

This discrepancy is not biased by inaccurate algorithms. Neither can it be dis- 
missed as the result of an unfortunate selection of sample size, noise and model 
family. The same phenomenon can be witnessed for ARMA models and many 
others under many different circumstances but especially for small sample sizes. 
In | Ruc89 a number of striking examples are given of rather innocent functions 
the output of which cannot reasonably be approximated by any function of the 
same family. Usually this happens when the output is very sensitive to minimal 
changes in the parameters. Still, the attractor of such a function can often be 
parameterized surprisingly well by a very different family of function^ 

For most practical purposes a good model is a model that minimizes the gener- 
alization error on future output of the process in question. But in the absence 
of further output even this is a weak definition. We might want to filter useful 
information from noise or to compress an overly redundant file into a more con- 
venient format, as is often the case in video and audio applications. In this case 
we need to select a model for which the data is most typical in the sense that 
the data is a truly random member of the model and virtually indistinguishable 
from all its other members, except for the noise. It implies that all information 
that has been lost during filtering or lossy compression was noise of a truly 
random nature. This definition of a good model is entirely independent from a 
source and is known as minimum randomness deficiency. It will be discussed in 
more detail on page [12) 

We now have three definitions of a good model : 

1. identifying family and degree of the original model for reconstruction 
purposes 

2. minimum generalization error for data prediction 

3. randomness deficiency for filters and lossy compression 

We have already seen that a model of the same family and degree as the original 
model does not necessarily minimize the generalization error. 



The important question is: can the randomness deficiency 
and the generalization error be minimized by the same model 
selection method? 



i 



To add to the confusion, a function that accurately describes an attractor is often advocated 
as the original function. This can be compared to confusing a fingerprint with a DNA string. 
Both are unique identifiers of their bearer but only one contains his blueprint. 
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Figure 2: Defining a good model 
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Original 4-degree polynomial (green), 100 point training sample, 3,000 point test 
sample and 8-degree polynomial trained on the training sample (blue). In case you 
are reading a black and white print: the 8-degree polynomial lies above the 4-degree 
polynomial at the left peak and the middle valley and below the 4-degree polynomial 
at the right peak. 
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The analysis of the generalization error a 2 . A 0-degree polynomial achieves a 2 = 14 
and a 4-degree polynomial a 2 = 3.4 on the test sample. All polynomials in the range 
6-18 degrees achieve a 2 < 1.3 with a global minimum of a 2 = 1.04 at 8 degrees. 
From 18 degrees onwards we witness overfitting. Different training samples of the 
same size might witness global minima for polynomials ranging from 6 to 8 degrees 
and overfitting may start from 10 degrees onwards. 4 degrees are always far worse 
than 6 degrees. The y-axis has logarithmic scale. 
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Such a general purpose method would simplify teaching and would enable many 
more people to deal with the problems of model selection. A general purpose 
method would also be very attractive to the embedded systems industry. Em- 
bedded systems often hardwire algorithms and cannot adapt them to specific 
needs. They have to be very economic with time, space and energy consump- 
tion. An algorithm that can effectively filter, compress and predict future data 
all at the same time would indeed be very useful. But before this question can 
be answered we have to introduce some mathematical theory. 

4 Information &; complexity theory 

This section provides a raw overview of the essential concepts. The interested 
reader is referred to the literature, especially the textbooks 

Elements of Information Theory 

by Thomas M. Cover and Joy A. Thomas, [CT91] 

Introduction to Kolmogorov Complexity and Its Applications 

by Ming Li and Paul Vitanyi, |LV97j 

which cover the fields of information theory and Kolmogorov complexity in 
depth and with all the necessary rigor. They are well to read and require only 
a minimum of prior knowledge. 

Kolmogorov complexity. The concept of Kolmogorov complexity was de- 
veloped independently and with different motivation by Andrei N. Kolmogorov 
[Kol65| . Ray Solomonoff (Sol64) and Gregory Chaitin |Cha66j . |Cha69] g] 

The Kolmogorov complexity C(s) of any binary string s £ {0, 1}™ is the length of C(-) 

the shortest computer program s* that can produce this string on the Universal 

Turing Machine UTM and then halt. In other words, on the UTM C(s) bits of UTM 

information are needed to encode s. The UTM is not a real computer but an 

imaginary reference machine. We don't need the specific details of the UTM. 

As every Turing machine can be implemented on every other one, the minimum 

length of a program on one machine will only add a constant to the minimum 

length of the program on every other machine. This constant is the length of the 

implementation of the first machine on the other machine and is independent 

of the string in question. This was first observed in 1964 by Ray Solomonoff. 

Experience has shown that every attempt to construct a theoretical model of 
computation that is more powerful than the Turing machine has come up with 
something that is at the most just as strong as the Turing machine. This has 
been codified in 1936 by Alonzo Church as Church's Thesis: the class of algo- 
rithmically computable numerical functions coincides with the class of partial 
recursive functions. Everything we can compute we can compute by a Turing 



2 Kolmogorov complexity is sometimes also called algorithmic complexity and Turing com- 
plexity. Though Kolmogorov was not the first one to formulate the idea, he played the 
dominant role in the consolidation of the theory. 
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machine and what we cannot compute by a Turing machine we cannot compute 
at all. This said, we can use Kolmogorov complexity as a universal measure 
that will assign the same value to any sequence of bits regardless of the model 
of computation, within the bounds of an additive constant. 

Incomputability of Kolmogorov complexity. Kolmogorov complexity is 
not computable. It is nevertheless essential for proving existence and bounds 
for weaker notions of complexity. The fact that Kolmogorov complexity cannot 
be computed stems from the fact that we cannot compute the output of every 
program. More fundamentally, no algorithm is possible that can predict of every 
program if it will ever halt, as has been shown by Alan Turing in his famous 
work on the halting problem [Tur36 . No computer program is possible that, 
when given any other computer program as input, will always output true if 
that program will eventually halt and false if it will not. Even if we have a 
short program that outputs our string and that seems to be a good candidate for 
being the shortest such program, there is always a number of shorter programs 
of which we do not know if they will ever halt and with what output. 

Plain versus prefix complexity. Turing's original model of computation 
included special delimiters that marked the end of an input string. This has 
resulted in two brands of Kolmogorov complexity: 

plain Kolmogorov complexity: the length C(s) of the shortest binary C(-) 

string that is delimited by special marks and that can compute x on 
the UTM and then halt. 

prefix Kolmogorov complexity: the length K(s) of the shortest binary K(-) 

string that is self- delimiting [LV97J and that can compute x on the 
UTM and then halt. 

The difference between the two is logarithmic in C(s): the number of extra bits 
that are needed to delimit the input string. While plain Kolmogorov complexity 
integrates neatly with the Turing model of computation, prefix Kolmogorov 
complexity has a number of desirable mathematical characteristics that make 
it a more coherent theory. The individual advantages and disadvantages are 
described in |LV97) . Which one is actually used is a matter of convenience. We 
will mostly use the prefix complexity K(s). 

Individual randomness. A. N. Kolmogorov was interested in Kolmogorov 
complexity to define the individual randomness of an object. When s has no 
computable regularity it cannot be encoded by a program shorter than s. Such 
a string is truly random and its Kolmogorov complexity is the length of the 
string itself plus the commando prin1£] And indeed, strings with a Kolmogorov 
complexity close to their actual length satisfy all known tests of randomness. A 
regular string, on the other hand, can be computed by a program much shorter 
than the string itself. But the overwhelming majority of all strings of any length 
are random and for a string picked at random chances are exponentially small 
that its Kolmogorov complexity will be significantly smaller than its actual 
length. 



Plus a logarithmic term if we use prefix complexity 
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This can easily be shown. For any given integer n there are exactly 2™ binary 
strings of that length and 2™ — 1 strings that are shorter than n: one empty 
string, 2 1 strings of length one, 2 2 of length two and so forth. Even if all strings 
shorter than n would produce a string of length n on the UTM we would still 
be one string short of assigning a C(s) < n to every single one of our 2™ strings. 
And if we want to assign a C(s) < n — 1 we can maximally do so for 2™ _1 — 1 
strings. And for C(s) < n — 10 we can only do so for 2™~ 10 — 1 strings which 
is less than 0.1% of all our strings. Even under optimal circumstances we will 
never find a C(s) < n — c for more than ^ of our strings. 

Conditional Kolmogorov complexity. The conditional Kolmogorov com- 
plexity K(s\a) is defined as the shortest program that can output s on the UTM K (• 
if the input string a is given on an auxiliary tape. K(s) is the special case K (s|e) 
where the auxiliary tape is empty. 

The universal distribution. When Ray Solomonoff first developed Kol- 
mogorov complexity in 1964 he intended it to define a universal distribution 
over all possible objects. His original approach dealt with a specific problem of 
Bayes' rule, the unknown prior distribution. Bayes' rule can be used to calculate 
P(m\s), the probability for a probabilistic model to have generated the sample 
s, given s. It is very simple. P{s\m), the probability that the sample will occur 
given the model, is multiplied by the unconditional probability that the model 
will apply at all, P(m). This is divided by the unconditional probability of the 
sample P(s). The unconditional probability of the model is called the prior 
distribution and the probability that the model will have generated the data is 
called the posterior distribution. 

PHs) = P(S| ^ (W) (2) 

Bayes' rule can easily be derived from the definition of conditional probability: 

„, , ^ P(m, s) , . 

p (m\s) = -^ (3) 



and 



„, , s P(m, s) , , 

P(m) 



The big and obvious problem with Bayes' rule is that we usually have no idea 
what the prior distribution P(m) should be. Solomonoff suggested that if the 
true prior distribution is unknown the best assumption would be the universal 
distribution 2~ K ( m > where K{m) is the prefix Kolmogorov complexity of the 
modeFJ This is nothing but a modern codification of the age old principle that 
is wildly known under the name of Occam's razor: the simplest explanation is 
the most likely one to be true. 



4 Originally Solomonoff used the plain Kolmogorov complexity C(-). This resulted in an 
improper distribution 2~ c ^ m > that tends to infinity. Only in 1974 L.A. Levin introduced 
prefix complexity to solve this particular problem, and thereby many other problems as 
well ILev74| . 
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Entropy. Claude Shannon |Sha48| developed information theory in the late 
1940's. He was concerned with the optimum code length that could be given to 
different binary words w of a source string s. Obviously, assigning a short code 
length to low frequency words or a long code length to high frequency words 
is a waste of resources. Suppose we draw a word w from our source string s 
uniformly at random. Then the probability p(w) is equal to the frequency of w 
in s. Shannon found that the optimum overall code length for s was achieved 
when assigning to each word w a code of length — logp(w). Shannon attributed 
the original idea to R.M. Fano and hence this code is called the Shannon-Fano 
code. When using such an optimal code, the average code length of the words 
of s can be reduced to 

H(s) = - ^2 p{w) log p(w) (5) 

where H(s) is called the entropy of the set s. When s is finite and we assign a H(-) 

code of length — logp(w) to each of the n words of s, the total code length is 

— } logp(w) = nH(s) (6) 

Let s be the outcome of some random process W that produces the words 
w G s sequentially and independently, each with some known probability 
p(W = w) > 0. if(s|W) is the Kolmogorov complexity of s given W. Because 
the Shannon-Fano code is optimal, the probability that K(s\W) is significantly 
less than nH(W) is exponentially small. This makes the negative log likelihood 
of s given W a good estimator of K(s\W): 



K(s\W) w nH(W) 

w J2 log p(w\W) ( ? ) 

= -\ogp(s\W) 



Relative entropy. The relative entropy D(p\\q) tells us what happens when -DHIO 

we use the wrong probability to encode our source string s. If p(w) is the true 
distribution over the words of s but we use q(w) to encode them, we end up 
with an average of H(p) + D(p\\q) bits per word. D(p\\q) is also called the 
Kullback Leibler distance between the two probability mass functions p and q. 
It is defined as 



D(p\\q) = £„(„,) log ^ (8) 

*— ' q(w) 



Fisher information. Fisher information was introduced into statistics some 20 
years before C. Shannon introduced information theory [Fis25]. But it was not 
well understood without it. Fisher information is the variance of the score V of 



10 
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the continuous parameter space of our models m k . This needs some explanation. 
At the beginning of this we defined models as binary strings that discretize the 
parameter space of some function or probability distribution. For the purpose 
of Fisher information we have to temporarily treat a model m k as a vector in 
M. k . And we only consider models where for all samples s the mapping f s {m, k ) 
defined by f s (mk) — p(s\m k ) is diffcrentiablc. Then the score V can be defined 
as 



d 
V = lnp(s TOfc) 

om k 

(9) 
p(s\m k ) 



The score V is the partial derivative of In p(s\m,k), a term we are already familiar 

with. The Fisher information J(mk) is </(•) 



J(m k ) = E, 



i i<k 



\np(s\m k ) 



dm k 



(10) 



Intuitively, a high Fisher information means that slight changes to the param- 
eters will have a great effect on p(s\m k ). If J(m k ) is high we must calculate 
p(s\m k ) to a high precision. Conversely, if J(m k ) is low, we may round p(s\m k ) 
to a low precision. 

Kolmogorov complexity of sets. The Kolmogorov complexity of a set of 
strings S is the length of the shortest program that can output the members 
of S on the UTM and then halt. If one is to approximate some string s with 
a < K(s) bits then the best one can do is to compute the smallest set S with 
K{S) < a that includes s. Once we have some S 9 s we need at most log |<S| 
additional bits to compute s. This set S is defined by the Kolmogorov structure 
function h s (-) 

h s {a) = min [log|«S| : S 3 s, K(S) < a] (11) 

o 

which has many interesting features. The function h s (a) + a is non increasing 
and never falls below the line K(s) + 0(1) but can assume any form within these 
constraints. It should be evident that 

h.(a) > K{s)-K(S) (12) 

Kolmogorov complexity of distributions. The Kolmogorov structure func- 
tion is not confined to finite sets. If we generalize h s (a) to probabilistic models 
m p that define distributions over K and if we let s describe a real number, we 
obtain 

h s (a) = min [ — \ogp(s\m p ) : p(s\m p ) > 0, K(m p ) < a] (13) 

11 
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where — \ogp(s\m p ) is the number of bits we need to encode s with a code that 
is optimal for the distribution defined by m p . Henceforth we will write m p when 
the model defines a probability distribution and m^ with k £ N when the model 
defines a probability distribution that has k parameters. A set S can be viewed 
as a special case of m p , a uniform distribution with 



p(s\m p ) 



W\ if s e S 
if s # S 



(14) 



Minimum randomness deficiency. The randomness deficiency of a string s 

with regard to a model m p is defined as S(-\m p ) 



5(s\m p ) = -log p(s\m p ) - K(s\m p , K(m p )) 



(15) 



for p(s) > 0, and oo otherwise. This is a generalization of the definition given 
in |VV02] where models are finite sets. If S(s\m p ) is small, then s may be 
considered a typical or low profile instance of the distribution, s satisfies all 
properties of low Kolmogorov complexity that hold with high probability for the 
support set of m p . This would not be the case if s would be exactly identical 
to the mean, first momentum or any other special characteristic of m p . 

Randomness deficiency is a key concept to any application of Kolmogorov com- 
plexity. As we saw earlier, Kolmogorov complexity and conditional Kolmogorov 
complexity are not computable. We can never claim that a particular string s 
does have a conditional Kolmogorov complexity 



K(s\m p 



\ogp{s\m f 



(16) 



The technical term that defines all those strings that do satisfy this approxima- 
tion is typicality, defined as a small randomness deficiency S(s\m p ). typicality 

Minimum randomness deficiency turns out to be important for lossy data com- 
pression. A compressed string of minimum randomness deficiency is the most 
difficult one to distinguish from the original string. The best lossy compression 
that uses a maximum of a bits is defined by the minimum randomness deficiency 
function As(') 



Ps(a) 



[5(s\m p ) : p(s\m p ) > 0, K(m p ) < a] 



(17) 



Minimum Description Length. The Minimum Description Length or short 

MDL of a string s is the length of the shortest two-part code for s that uses MDL 

less than a bits. It consists of the number of bits needed to encode the model 

m p that defines a distribution and the negative log likelihood of s under this 

distribution. ^s(') 



A 8 (a) 



[ - \ogp(s\m p ) + K(m p ) : p(s\m p ) > 0, K(m p ) < 



(18) 
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It has recently been shown by Nikolai Vereshchagin and Paul Vitanyi in |W02] 
that a model that minimizes the description length also minimizes the random- 
ness deficiency, though the reverse may not be true. The most fundamental 
result of that paper is the equality 



a (a) = h a (a) + a-K(s) = X s (a) - K(s) 



(19) 



where the mutual relations between the Kolmogorov structure function, the 
minimum randomness deficiency and the minimum description length are pinned 
down, up to logarithmic additive terms in argument and value. 



MDL minimizes randomness deficiency. With this important 
result established, we are very keen to learn whether MDL 
can minimize the generalization error as well. 



5 Practical MDL 



From 1978 on Jorma Rissanen developed the idea to minimize the generalization 
error of a model by penalizing it according to its description length |Ris78j . 
At that time the only other method that successfully prevented overfitting by 
penalization was the Akaike Information Criterion (AIC). The AIC selects the 
model m-fc according to 



L 



Aic(s) = min [n \oga 2 k + 2k] 

h 



(20) 



where a\ is the mean squared error of the model m^ on the training sample s, 
n the size of s and k the number of parameters used. H. Akaike introduced the 
term 2k in his 1973 paper [Aka73] as a penalty on the complexity of the model. 



Compare this to Rissanen's original MDL criterion: 



^Ris(s) = min 



-logp(s\m k ) + fclog- 



>n 



(21) 



£ 



AIC 



(0 



^Ris(-) 



Rissanen replaced Akaike's modified error nlog(cr^) by the information theoret- 
ically more correct term — logp(s\mk). This is the length of the Shannon-Fano 
code for s which is a good approximation oi K (s\mk) , the complexity of the data 
given the fc-parameter distribution model to^, typicality assumed/] Further, he 
penalized the model complexity not only according to the number of parame- 
ters but according to both parameters and precision. Since statisticians at that 
time treated parameters usually as of infinite precision he had to come up with 
a reasonable figure for the precision any given model needed and postulated it 
to be log y/n per parameter. This was quite a bold assumption but it showed 
reasonable results. He now weighted the complexity of the encoded data against 



5 For this approximation to hold, s has to be typical for the model m^. See Section Ul on 
page 1121 for a discussion of typicality and minimum randomness deficiency. 
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5 PRACTICAL MDL 



the complexity of the model. The result he rightly called Minimum Descrip- 
tion Length because the winning model was the one with the lowest combined 
complexity or description length. 

Rissanen's use of model complexity to minimize the generalization error comes 
very close to what Ray Solomonoff originally had in mind when he first developed 
Kolmogorov complexity. The maximum a posteriori model according to Bayes' 
rule, supplied with Solomonoff 's universal distribution, will favor the Minimum 
Description Length model, since 



[P( 



m\s)\ = max 



P(s\m) P(m) 
. ' P(s) 

P(s\m) 2- K ^ 



= min [ — log P(s\m) + K(rri) 



(22) 



Though Rissanen's simple approximation of K(m) w klog^/n could compete 
with the AIC in minimizing the generalization error, the results on small samples 
were rather poor. But especially the small samples are the ones which are most 
in need of a reliable method to minimize the generalization error. Most methods 
converge with the optimum results as the sample size grows, mainly due to the 
law of large numbers which forces the statistics of a sample to converge with the 
statistics of the source. But small samples can have very different statistics and 
the big problem of model selection is to estimate how far they can be trusted. 

In general, two-part MDL makes a strict distinction between the theoretical 
complexity of a model and the length of the implementation actually used. All 
versions of two-part MDL follow a three stage approach: 

1. the complexity — \ogp{s\rrik) of the sample according to each model mk 
is calculated at a high precision of m^ . 

2. the minimum complexity K(rrik) which would theoretically be needed to 
achieve this likelihood is estimated. 

3. this theoretical estimate E\K(rrik)] minus the previous \ogp(s\irik) ap- 
proximates the overall complexity of data and model. 



Mixture MDL. More recent versions of MDL look deeper into the complexity 
of the model involved. Solomonoff and Rissancn in their original approaches 
minimized a two-part code, one code for the model and one code for the sample 
given the model. Mixture MDL leaves this approach. We do no longer search 
for a particular model but for the number of parameters k that minimizes the 
total code length —\ogp(s\k) + log(fc). To do this, we average —log p(s\nik) 
over all possible models m^ for every number of parameters fc, as will be defined 
further below. £ m ii(') 
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5 PRACTICAL MDL 



£>mix(s) = min - log p(s\k) + log k (23) 

k 



Since the model complexity is reduced to log k which is almost constant and 
has little influence on the results, it is not appropriate anymore to speak of a 
mixture code as a two-part code. 

Let M k be the fc-dimensional parameter space of a given family of models and 
let p(M k — m k ) be a prior distribution over the models in AfJj Provided this 
prior distribution is defined in a proper way we can calculate the probability 
that the data was generated by a /c-parameter model as 

p(s\k) = / p(m k ) p{s\m k ) dm k (24) 

Jm k eM k 

Once the best number of parameters k is found we calculate our model m k in 
the conventional way. This approach is not without problems and the various 
versions of mixture MDL differ in how they address them: 

• The binary models m k form only a discrete subset of the continuous pa- 
rameter space M k . How are they distributed over this parameter space 
and how does this effect the results? 

• what is a reasonable prior distribution over M k l 

• for most priors the integral goes to zero or infinity. How do we normalize 

it? 

• the calculations become too complex to be carried out in practice. 



Minimax MDL. Another important extension of MDL is the minimax strat- 
egy. Let m k be the fc-parameter model that can best predict n future values 
from some i.i.d. training values. Because m k is unknown, every model rh k that 
achieves a least square error on the training values will inflict an extra cost when 
predicting the n future values. This extra cost is the Kullback Leibler distance 

D(m k \\m k ) = £ p ( x -\ mk )log&^\. (25) 

The minimax strategy favors the model m k that minimizes the maximum of 

this extra cost. £ mm (-) 

£mm = min max D{m k \\rh k ) (26) 

k m k £M k 

6 For the moment, treat models as vectors in R fe so that integration is possible. See the 
discussion on Fisher information in Section |4| on pagellOlfor a similar problem. 
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6 ERROR MINIMIZATION 



6 Error minimization 

Any discussion of information theory and complexity would be incomplete with- 
out mentioning the work of Carl Friedrich Gauss (1777-1855). Working on as- 
tronomy and geodesy, Gauss spend a great amount of research on how to extract 
accurate information from physical measurements. Our modern ideas of error 
minimization are largely due to his work. 

Euclidean distance and mean squared error. To indicate how well a partic- 
ular function f(x) can approximate another function g{x) we use the Euclidean 
distance or the mean squared error. Minimizing one of them will minimize the 
other so which one is used is a matter of convenience. We use the mean squared 
error. For the interval x £ [a, b] it is defined as 

a) = -L- / (f(x)-g(x)) 2 dx (27) 



This formula can be extended to multi-dimensional space. 

Often the function that we want to approximate is unknown to us and is only 
witnessed by a sample that is virtually always polluted by some noise. This 
noise includes measurement noise, rounding errors and disturbances during the 
execution of the original function. When noise is involved it is more difficult to 
approximate the original function. The model has to take account of the distri- 
bution of the noise as well. To our great convenience a mean squared error a 2 
can also be interpreted as the variance of a Gaussian or normal distribution. The 
Gaussian distribution is a very common distribution in nature. It is also akin to 
the concept of Euclidean distance, bridging the gap between statistics and ge- 
ometry. For sufficiently many points drawn from the distribution J\f(f(x), a 2 ) A/"(-, •) 
the mean squared error between these points and f(x) will approach a and 
approximating a function that is witnessed by a sample polluted by Gaussian 
noise becomes the same as approximating the function itself. 

Let a and b be two points and let I be the Euclidean distance between them. A 
Gaussian distribution p(l ) around a will assign the maximum probability to b 
if the distribution has a variance that is equal to I 2 . To prove this, we take the 
first derivative of p(l) and equal it to zero: 



£*'>- 


d * c -l 2 /2a 2 


der aV^ir " 




-1 -l 2 /2a 2 1 




a 2 v27T ay2'K 




1 r -* 2 /2a 2 (I 2 l 




<j 2 V2tt " W 


= 





which leaves us with 





e -l 2 /2cr 2 I 2l ' 

2 °"'' ' (28) 
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7 CRITICAL POINTS 



a z = V. (29) 

Selecting the function / that minimizes the Euclidean distance between f{x) 
and g(x) over the interval [a, b] is the same as selecting the maximum likelihood 
distribution, the distribution A/"(/(x), ct 2 ) that gives the highest probability to 
the values of g(x). 

Maximum entropy distribution. Of particular interest to us is the entropy 
of the Gaussian distribution. The optimal code for a value that was drawn 
according to a Gaussian distribution p(x) with variance a 2 has a mean code 
length or entropy of 



H{P) = — I p(x)logp(x) dx 

J — oo 

(30) 
= -log(2irea 2 ) 

To always assume a Gaussian distribution for the noise may draw some criti- 
cism as the noise may actually have a very different distribution. Here another 
advantage of the Gaussian distribution comes in handy: for a given variance 
the Gaussian distribution is the maximum entropy distribution. It gives the 
lowest log likelihood to all its members of high probability. That means that 
the Gaussian distribution is the safest assumption if the true distribution is 
unknown. Even if it is plain wrong, it promises the lowest cost of a wrong 
prediction regardless of the true distribution |Gru00j . 

7 Critical points 

Concluding that a method does or does not select a model close to the optimum 
is not enough for evaluating that method. It may be that the selected model 
is many degrees away from the real optimum but still has a low generaliza- 
tion error. Or it can be very close to the optimum but only one degree away 
from ovcrfitting, making it a risky choice. A good method should have a low 
generalization error and be a save distance away from the models that ovcrfit. 

How a method evaluates models other than the optimum model is also impor- 
tant. To safeguard against any chance of overfitting we may want to be on the 
safe side and choose the lowest degree model that has an acceptable generaliza- 
tion error. This requires an accurate estimate of the per model performance. 
We may also combine several methods for model selection and select the model 
that is best on average. This too requires reliable per model estimates. And not 
only the generalization error can play a role in model selection. Speed of compu- 
tation and memory consumption may also constrain the model complexity. To 
calculate the best trade off between algorithmic advantages and generalization 
error we also need accurate per model performance. 
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7 CRITICAL POINTS 



When looking at the example we observe a number of critical points that can 
help us to evaluate a method: 

the origin: the generalization error when choosing the simplest model pos- 
sible. For polynomials this is the expected mean of y ignoring x. 

the initial region: may contain a local maximum slightly worse than the 
origin or a plateau where the generalization error is almost constant. 

the region of good generalization: the region that surrounds the opti- 
mum and where models perform better than half way between origin 
and optimum. Often the region of good generalization is visible in 
the generalization analysis as a basin with sharp increase and de- 
crease at its borders and a flat plateau in the center where a number 
of competing minima are located. 

the optimum model: the minimum within the region of good general- 
ization. 

false minima: models that show a single low generalization error but lie 
outside or at the very edge of the region of good generalization. 

overfitting: from a certain number of degrees on all models have a gener- 
alization error worse than the origin. 

Let us give some more details about three important features: 

Region of good generalization. The definition of the region of good gen- 
eralization as better than half way between origin and optimum needs some 
explanation. Taking an absolute measure is useless as the error can be of any 
magnitude. A relative measure is also useless because while in one case origin 
and optimum differ only by 5 percent with many models in between, in another 
case even the immediate neighbors might show an error two times worse than 
the optimum but still much better than the origin. 

Better than half way between origin and optimum may seem a rather weak 
constraint. With big enough samples all methods might eventually agree on 
this region and it may become obsolete. But we arc primarily concerned with 
small samples. And as a rule of thumb, a method that cannot fulfill a weak 
constraint will be bad on stronger constraints as well. 

False minima. Another point that deserves attention are the false minima. 
While different samples of the same size will generally agree on the generalization 
error at the origin, the initial region, the region of good generalization and the 
region of real overfitting, the false minima will change place, disappear and pop 
up again at varying amplitudes. They can even outperform the real optimum. 
The reason for this can lie in rounding errors during calculations and in the 
random selection of points for the training set. And even though the training 
sample might miss important features of the source due to its restricted size, the 
model might hit on them by sheer luck, thus producing an exceptional minimum. 
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Cross-validation particularly suffers from false minima and has to be smoothed 
before being useful. Taking the mean over three adjacent values has shown to 
be a sufficient cure. Both versions of MDL seem to be rather free of it. 

Point of real overfitting. The point where overfitting starts also needs some 
explanation. It is tempting to define every model larger than the optimum 
model as ovcrfittcd and indeed, this is often done. But such a definition cre- 
ates a number of problems. First, the global optimum is often contained in a 
large basin with several other local optima of almost equal generalization error. 
Although we assume that the training sample carries reliable information on 
the general outline of generalization error, we have no evidence that it carries 
information on the exact quality of each individual model. It would be wrong 
to judge a method as overfitting because it selected a model 20 degrees too 
high but of a low generalization error if we have no indication that the training 
sample actually contained that information. On the other hand, at the point 
on the x-axis where the generalization becomes forever worse than at the origin 
the generalization error usually shows a sharp increase. From this point on dif- 
ferences are measured in orders of magnitude and not in percent which makes 
it a much clearer boundary. Also, if smoothing is applied, different forms of 
smoothing will favor different models within the region of good generalization 
while they have little effect on the location of the point where the generalization 
error gets forever worse. And finally, even if a method systematically misses the 
real optimum, as long as it consistently selects a model well within the region 
of good generalization of the generalization analysis it will lead to good results. 
But selecting or even encouraging a model beyond the point where the error 
gets forever worse than at the origin is definitely unacceptable. 
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